39 research outputs found

    Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

    Full text link
    Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Specifically, MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. For the remaining layers, MASformer only employs sparse attention to capture short-range dependencies. Our experiments on natural language modeling and generation tasks show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention while significantly reducing computational cost (up to 75%). Additionally, we investigate the effectiveness of continual training with long sequence data and how sequence length impacts downstream generation performance, which may be of independent interest.Comment: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023 Findings

    Sparse Subspace Modeling for Query by Example Spoken Term Detection

    Get PDF
    We cast the problem of query by example spoken term detection (QbE-STD) as subspace detection where query and background are modeled as a union of low-dimensional subspaces. The speech exemplars used for subspace modeling consist of class-conditional posterior probabilities obtained from deep neural network (DNN). The query and background training exemplars are exploited to model the underlying low-dimensional subspaces through dictionary learning and sparse coding. Given the dictionaries characterizing the query and background speech, QbE-STD amounts to subspace detection via sparse representation and the reconstruction error is used for binary classification. Furthermore, we rigorously investigate the relationship between the proposed method and the generalized likelihood ratio test. The experimental evaluation demonstrate that the proposed method is able to detect the query given a single exemplar and performs significantly better than one of the best QbE-STD baseline systems based on template matching

    Subspace Regularized Dynamic Time Warping for Spoken Query Detection

    Get PDF
    Deep neural network posterior probabilities are the best features for query detection in speech archives. Dynamic time warping (DTW) is the state-of-the-art solution for this task. Posterior features live in low-dimensional subspaces whereas, the current DTW methods do not incorporate this global structure of the data and rely on local feature distances. We exploit the query example as the dictionary for sparse recovery. Local DTW scores are integrated with the sparse reconstruction scores to obtain a subspace regularized distance matrix for DTW. The proposed method yields a substantial performance gain over the baseline system

    Integration of Real-Time Speech Processing Technologies for Online Gaming

    Get PDF
    This work demonstrates an application of different real-time speech technologies, exploited in an online gaming scenario. The game developed for this purpose is inspired by the famous television based quiz-game show, “Who wants to be a millionaire”, in which multiple-choice questions of increasing difficulty are asked to the participant. Text-to-speech synthesis is used to read out the questions and the possible answers to the user, while an automatic speech recognition engine is exploited to get input from the player, in order to proceed through the game. The speech data is recorded from the user with the help of a real-time voice activity detector to select speech segments from the input audio data. The developed Java application allows an automatic insertion of new multiple-choice questions, of different complexity, which could then be selected during the game

    Redundant Hash Addressing for Large-Scale Query by Example Spoken Query Detection

    Get PDF
    State of the art query by example spoken term detection (QbE-STD) systems rely on representation of speech in terms of sequences of class-conditional posterior probabilities estimated by deep neural network (DNN). The posteriors are often used for pattern matching or dynamic time warping (DTW). Exploiting posterior probabilities as speech representation propounds diverse advantages in a classification system. One key property of the posterior representations is that they admit a highly effective hashing strategy that enables indexing the large archive in divisions for reducing the search complexity. Moreover, posterior indexing leads to a compressed representation and enables pronunciation dewarping and partial detection with no need for DTW. We exploit these characteristics of the posterior space in the context of redundant hash addressing for query-by-example spoken term detection (QbE-STD). We evaluate the QbE-STD system on AMI corpus and demonstrate that tremendous speedup and superior accuracy is achieved compared to the state-of-the-art pattern matching and DTW solutions. The system has great potential to enable massively large scale query detection

    Subspace Detection of DNN Posterior Probabilities via Sparse Representation for Query by Example Spoken Term Detection

    Get PDF
    We cast the query by example spoken term detection (QbE-STD) problem as subspace detection where query and background subspaces are modeled as union of low-dimensional subspaces. The speech exemplars used for subspace modeling are class-conditional posterior probabilities estimated using deep neural network (DNN). The query and background training exemplars are exploited to model the underlying low-dimensional subspaces through dictionary learning for sparse representation. Given the dictionaries characterizing the query and background subspaces, QbE-STD is performed based on the ratio of the two corresponding sparse representation reconstruction errors. The proposed subspace detection method can be formulated as the generalized likelihood ratio test for composite hypothesis testing. The experimental evaluation demonstrate that the proposed method is able to detect the query given a single example and performs significantly better than a highly competitive QbE-STD baseline system based on template matching

    Sparse Pronunciation Codes for Perceptual Phonetic Information Assessment

    Get PDF
    Speech is a complex signal produced by a highly constrained articulation machinery. Neuro and psycholinguistic theories assert that speech can be decomposed into molecules of structured atoms. Although characterization of the atoms is controversial, the experiments support the notion of invariant speech codes governing speech production and perception. We exploit deep neural network (DNN) invariant representation learning for probabilistic characterization of the phone attributes defined in terms of the phonological classes and known as the smallest-size perceptual categories. We cast speech perception as a channel for phoneme information transmission via the phone attributes. Structured sparse codes are identified from the phonological probabilities for natural speech pronunciation. We exploit the sparse codes in information transmission analysis for assessment of phoneme pronunciation. The linguists define a single binary phonological code per phoneme. In contrast, probabilistic estimation of the phonological classes enables us to capture large variation in structures of speech pronunciation. Hence, speech assessment may not be confined to the single expert knowledge based mapping between phoneme and phonological classes and it may be extended to multiple data-driven mappings observed in natural speech

    Language Independent Query by Example Spoken Term Detection

    No full text
    Language independent query-by-example spoken term detection (QbE-STD) is the problem of retrieving audio documents from an archive, which contain a spoken query provided by a user. This is usually casted as a hypothesis testing and pattern matching problem, also referred to as a ``zero-resource task'' since no specific training or lexical information is required to represent the spoken query. Thus it enables multilingual search on unconstrained speech without requiring a full speech recognition system. State-of-the-art solutions typically rely on Dynamic Time Warping (DTW) based template matching using phone posteriors features estimated by Deep Neural Networks (DNN). In this thesis, we aim at exploiting the low-dimensional subspace structure of speech signal, resulting from the constrained human speech production process. We exploit this subspace structure to improve over the state-of-the-art to (1) generate better phone or phonological posterior features, and (2) to improve the matching algorithm. To enhance phone posteriors, we learn the underlying phonetic subspaces in an unsupervised way, and use the sub-phonetic attributes to extract the phonological components in a supervised manner. To improve the matching algorithm, we model the subspaces of the spoken query using its phone posterior representation. The resulting model is used to compute distances between the subspaces of the query and the phone posteriors of each audio document. These distances are then used to detect occurrences of the spoken query, while also regularizing the DTW to improve the detection scores. In addition to optimizing different components of the state-of-the-art system, we propose a novel DNN-based QbE-STD system to provide an end-to-end learning framework. Towards that end, we replace the DTW based matching with a Convolutional Neural Network (CNN) architecture. We also learn multilingual features, aimed at obtaining language independent representation. Finally, we integrate the feature learning and CNN-based matching to jointly train and further improve the QbE-STD performance. We perform experiments using the challenging AMI meeting corpus (English), as well as multilingual datasets such as Spoken Web Search 2013 and Query by Example Search on Speech Task 2014, and show significant improvements over a very competitive state-of-the-art system
    corecore